Phonemes based detection of parkinson’s disease for telehealth applications

Dysarthria is an early symptom of Parkinson’s disease (PD) which has been proposed for detection and monitoring of the disease with potential for telehealth. However, with inherent differences between voices of different people, computerized analysis have not demonstrated high performance that is consistent for different datasets. The aim of this study was to improve the performance in detecting PD voices and test this with different datasets. This study has investigated the effectiveness of three groups of phoneme parameters, i.e. voice intensity variation, perturbation of glottal vibration, and apparent vocal tract length (VTL) for differentiating people with PD from healthy subjects using two public databases. The parameters were extracted from five sustained phonemes; /a/, /e/, /i/, /o/, and /u/, recorded from 50 PD patients and 50 healthy subjects of PC-GITA dataset. The features were statistically investigated, and then classified using Support Vector Machine (SVM). This was repeated on Viswanathan dataset with smartphone-based recordings of /a/, /o/, and /m/ of 24 PD and 22 age-matched healthy people. VTL parameters gave the highest difference between voices of people with PD and healthy subjects; classification accuracy with the five vowels of PC-GITA dataset was 84.3% while the accuracy for other features was between 54% and 69.2%. The accuracy for Viswanathan’s dataset was 96.0%. This study has demonstrated that VTL obtained from the recording of phonemes using smartphone can accurately identify people with PD. The analysis was fully computerized and automated, and this has the potential for telehealth diagnosis for PD.

www.nature.com/scientificreports/ involves broad factors such as the psychology, linguistics, and cognitive conditions of patients. On the other hand, phonatory aspects of a sustained phoneme are less influenced by the above factors. Studies have investigated the effectiveness of sustained phoneme parameters in representing the phenomenon of Parkinsonian hypokinetic dysarthria 16,[18][19][20][21] . Most of the studies were focused on the parameters that are closely related to impairments in vocal cord vibration. The pitch frequency variation, number of pulses, jitter (perturbation of the glottal vibration period), shimmer (amplitude perturbation of glottal vibration), autocorrelation, and harmonics to noise ratio (HNR/NHR) were used in the authors previous work 22 , as well as in the work of Orozco-Arroyave 23 , Behroozi et al. 24 , Tsanas and Little 25 , Ali et al. 26 , Sakar et al. 19 , and Rusz et al. 6 .
Speech production features extracted from the glottal waveform remove the effect of articulation on the acoustic signal. They approximate the volume velocity of the air flowing through the vocal folds and may have an advantage for the analysis of the pathological voice.
Physiologically, these glottic source features are associated with (1) the frequency, amplitude, symmetry, and periodicity of vocal fold vibration; (2) the competency of glottic closure, and (3) speed of the vibratory cycle and the ratio of its open to closed phases. Breathiness, the hallmark perceptual voice quality of parkinsonian speech, is associated with incomplete closure of the vocal folds leading to air escape, and thus the presence of relatively higher noise in the voice, lowered the intensity and a predominance of the open phase of glottic pulse 8,30 . People with PD have higher jitter and lower HNR, associated with aperiodicity of vocal fold vibration and perceived as roughness. Connected speech of people with PD is monotonous and has reduced pitch and loudness variation.
Perez 31 combined the above parameters with thirteen Mel Frequency Cepstral Coefficients (MFCCs) that represent the energy and articulatory positions. Fractal dimension (FD) features that measure the complexity of the signal was used by Viswanathan et al. 32 . More recently, multivariate deep-features have been found to be effective 33 .
Even though the above studies have demonstrated some significant differences between the voice parameters of controls and people with PD, their implementation in a generalized automatic system is not straightforward 34 . There is also evidence of inconsistent results between different studies 32 .
Gillivan-Murphy 35 published preliminary findings based on nasolaryngoscopy which shows that PD voice tremor is not associated with the vocal folds. PD voice tremor is likely to be related to oscillatory movement in structures across the vocal tract rather than just the vocal folds. Furthermore, pronouncing a phoneme is a voluntary activity while PD tremors exist during rest. This may result in an inconsistent appearance of voice tremor in sustained and steady phoneme recordings which is essential for glottal vibration parameters.
The parameters other than the glottal vibration parameters that may potentially be used in PD identification are the parameters related to phonatory airflow and pneumatic pressure to the larynx such as voice intensity and the parameters related to vocal tract muscles such as formants and Vocal Tract Length (VTL) 36,37 .
This study has investigated and compared the effectiveness of three groups of parameters to differentiate the voice of people with PD from that of age-matched healthy participants. These are related to three domains of speech production control: (i) the stability of lung control, (ii) the periodicity and stability of glottal vibration control, and (iii) the stability of vocal tract control. Standard deviation (SD) and range of phonemes intensity were used to measure the lung stability while the shimmer, jitter, SD of pitch, and harmonics parameters were used for the stability of glottal vibration. The vocal tract stability was represented by the SD of the first four formants and the apparent Vocal Tract Length (VTL).
The comparison was examined using a statistical hypothesis test, followed by classification using the Support Vector Machine (SVM). The parameters were extracted from the recordings of sustained phonemes /a/, /e/, /i/, /o/, and /u/. Public database PC-GITA was used for this study. To evaluate the consistency of the method between different datasets, the SVM classifications were also applied to Viswanathan's dataset 38 which contains the recordings of /a/, /o/, and /m/.

Database of recordings.
Two databases of recordings were used in this study. The first is the publicly available database, PC-GITA, provided by Rafael Orozco et al. 23 . It contains the recordings of 100 Columbian-Spanish native speakers, 50 of them were diagnosed with PD, and the other 50 were age and gender-matched participants with no PD or any other neurological disease symptoms. Table 1 presents participants' demographic and clinical information. The p-values in the table confirm that there was no significant age difference between the groups as well as showing the matched clinical stage between male and female groups of PD subjects. The speech recording of the PD subjects was conducted within 3-h after their morning medication and hence has been in pharmacological ON-state. The procedure complied with the Helsinki Declaration and was approved by the Ethics Committee of the Clinica Noel, in Medellin, Colombia.
The recordings were captured in noise-controlled conditions and sampled at 44,100 Hz with 16 resolution bits, using a dynamic omnidirectional microphone (Shure, SM 63L). In this study, we use the recording of the five vowels /a/, /e/, /i/, /o/, and /u/. The participants produced three repetitions of the sustained vowel, each done as long as possible in one breath, at their natural pitch and loudness. Figure 1 illustrates the waveforms of the five vowels recorded from control and PD patients.
The second is the Viswanathan's dataset 32  www.nature.com/scientificreports/ All people with PD have been diagnosed within the last ten years. Three sustained phonemes /a/, /o/, and /m/ were recorded from each participant in a noise-restricted environment using Samson-SE50 microphone. The recordings were stored in a single-channel WAV format with a sampling rate of 48 kHz and a 16-bit resolution. The sustained phonemes of people with PD in the database were recorded in on-state and off-state medication. However, for this study, only the on-state recordings were used. Table 2 provides the demographics of the subjects. The detailed information can be found in 22,32 . Parameter extraction. A publicly available speech analysis software, Praat 39 , was used to extract speech features from the recordings. Before features extraction, the recordings were trimmed to a uniform duration of 0.5 s based on the assumption that vowels correspond to largely stationary signals. The recordings were filtered with an IIR 4th order Butterworth band-pass filter of 50 Hz to 4 kHz.  Figure 1. The waveforms of the five vowels recorded from the control subjects and the PD subjects. www.nature.com/scientificreports/ Voice intensity parameters. The voice intensity is controlled by the subglottal pressure, which is controlled by the respiratory muscles and the lung volume 40 and thus, it is hypothesized that people with PD will have increased variation and reduced range of the voice intensity. The standard deviation and range of intensity are proportional to the fluctuation of lung pressure during the pronunciation of the sustained phoneme that may capture the tremor or rigidity due to Parkinson's disease. The standard deviation and range of voice intensity were obtained for each recording. The parameters measure the ability of the subject to keep the stability of air pressure produced by the lung. The intensity, I (in dB), of an input voice s(t) with a duration of T, were calculated using Praat's function with energy averaging method as in Eq. (1).
Periodicity and stability of glottal vibration. It is commonly assumed that Parkinsonian dysarthria is affected by the abnormal vibration of the vocal cords, such as the inadequate or excessive closing of the vocal cords and irregular or asymmetrical vocal fold, as well as a tremor in its muscles 8,34,35 . A total of 6 parameters related to the periodicity and stability of glottal vibration were extracted from each recording. The parameters were jitter absolute (abs), jitter relative (rel), the absolute shimmer (in dB), the relative shimmer, the standard deviation of pitch frequency (f 0 ), the HNR, and the NHR.
The jitter parameters 41 were related to time perturbation glottal pulses, T i . The equation to calculate the two jitter parameters 41 are shown in Eqs. (2) and (3): The shimmer parameters 41 were related to amplitude perturbation of the glottal cycles. The parameters were calculated with Eqs. (4) and (5): The standard deviation of the pitch was calculated based on the instantaneous pitch frequency f 0 i = 1/T i . The HNR and NHR were calculated based on the normalized autocorrelation function of the segment. R xx [T 0 ] is the peak next to the center of R xx at a distance corresponding to the T 0 of the recording. The HNR and NHR were calculated as described in Eqs. (6) and (7) 42,43 : Formants parameters. The limitations of the control in the speech production process by the people with PD leads to some disturbances including the change in phonatory and resonant characteristics 34 . The disturbances in the resonant characteristics are due to an inaccurate position of the articulators or a lack of control of vocal tract muscles. The accurate position and control of vocal tract muscles can be observed in the fluctuation of formants frequencies. The stability of vocal tract control in this study was measured with a standard deviation of the first four formants (F 1 , F 2 , F 3 , and F 4 ) and the Vocal Tract Length (VTL). The formants of each recording were extracted from Praat using Burg's method 44 with a maximum formant value of 5.5 kHz, a window length of 25 ms, a time step of 6.25 ms, and a pre-emphasis from 50 Hz. The mean and standard deviation were then calculated for each recording.
Vocal tract length. The other parameter that captures the resonant characteristic of the vocal tube model of voice production is the apparent vocal tract length (VTL). VTL is the estimation of the physical vocal tract length of a subject while pronouncing a specific voice based on formants frequency. VTL has been used in other voice analyses such as speaker verification 45 , identifying body measures 36,46 .
VTL of each recording was calculated (in cm) from the mean of the four formants, F i , with the formula in Pisanski et al. 36 .
(1) I = 10log 10 1 Statistical analysis. The mean and standard deviation of all the parameters were computed for the two groups of the PC-GITA database: PD and CO. The normality of the extracted parameters was examined with the Anderson-Darling test 47 . Mann Whitney U-test 48 was used to compare the group differences for speech parameters between PD and control subjects. The 95% confidence level was considered for the analysis and p-value < 0.05 to indicate that the mean of the groups was significantly different. All the statistical analyses were performed using MATLAB2018b (MathWorks). Support vector machine classification. The effectiveness of the parameters to classify PD and control subjects was investigated with Support Vector Machines (SVM) 49 classifier. The SVM was trained with a Gaussian kernel and validated using "leave-one-out" cross-validation. The Gaussian kernel was selected anecdotally since it yielded the best result compared to the other kernels. The input to the SVM were the sets of voice parameters and the ten highest-ranked features, selected using the Relief-F algorithm 50

Results
Statistical analysis. The Anderson-Darling test confirmed that except for some VTL parameters, the parameters were not normally distributed. Mann Whitney U-test, a non-parametric test, was thus used to test for group differences in each of the features. Table 3 provides the statistical distribution (mean ± SD) and p-value and effective size of Mann Whitney U-test between CO and PD for all the features. The table shows that the parameters of people with PD fluctuated more than CO. The voice intensity of people with PD has both higher SD and range, which indicates their diminished ability to produce sustained phonemes with stable air pressure. The p < 0.05 shows that the group difference was significant.
The statistical distribution of the glottal vibration parameters, i.e., jitter, shimmer, SD of pitch, was significantly higher for people with PD compared to the CO, with p-value < 0.05. The HNR and NHR distribution show that PD voice had higher noise (non-periodic) components compared to healthy people.
For vocal tract parameters, except for phoneme /o/ and /u/, the first three formants (F 1 , F 2 , and F 3 ) of PD patients have a significantly higher standard deviation compared to the normal subjects. The majority of VTL parameters did not show significant differences between PD and normal subjects. The p-value and effect size confirm that statistically, the mean of the groups was not significantly different. SVM classification. The SVM classification results of recordings from the PC-GITA database for the four groups of input parameters are shown in Table 4. It presents the accuracy, sensitivity, and selectivity when considering each vowel independently and with the combination of the five vowels. For the sake of presentation simplicity and without loss to the outcome of this work, the table only presents the results of the vowel combination with significant accuracy. The results show that the classification accuracy of 84.3% was obtained with the combination of all the vowels when the SVM input were VTL(F i ); the overall observation is that VTL is the most effective feature to distinguish between voice of PD and CO. The SVM classification accuracy was 71.2% when it was given the ten highest-ranked features selected by the Relief-F algorithm. The ten highest-ranked features selected by Relief- To evaluate the consistency of SVM classification using VTL(F i ) in different databases, the SVM classifications using VTL(F i ) were also applied to Viswanathan's dataset 38 which contains the recordings of /a/, /o/, and /m/. Table 5 provides the classification results of the recordings in the database. The table shows that the SVM classification using VTL(Fi) as input parameters performs consistently with different databases. The highest accuracy was 96.0% with the combination of VTL(Fi) of /a/ and /m/, while an accuracy of 94.0% was obtained with the combination of /a/, /o/, and /m/.

Discussion
Several earlier studies that have proposed the use of voice-based diagnosis and assessment of Parkinson's disease 16,[18][19][20][21][22] . These studies used the vocal cord vibration parameters such as pitch frequency variation, number of pulses, jitter, shimmer, autocorrelation, and harmonics to noise ratio (HNR/NHR). While these studies This study has identified VTL as a potential parameter to be used in the classification of PD patients based on sustained phoneme recordings. The parameters have achieved 84.3% accuracy, 84.0% sensitivity, 84.7% specificity when used in PC-GITA database with five vowels /a/, /e/, /i/, /o/, and /u/. This study showed the consistency of the parameters when applied in different datasets. Table 5 shows that when applied in Viswanathan's datasets, VTL parameters could classify PD patients from healthy subjects with an accuracy of 96.0%. This study has shown that among the features reported in the literature, VTL features are most suitable for differentiating the voice of people with PD from that of Control. VTL is an approximate measure of the physical vocal tract length while producing voice. The shape and length of the vocal tract affect the value and space of formants. Longer vocal tracts produce lower, more closely spaced formants 36 . Although the length of the vocal tract mainly depends on the physical body structure, the study of Piransky et al. 37 found that a person may voluntarily modify the length of the vocal tract up to 25%. The result reported in this paper indicates the possible relation between the modification of vocal tract length by a subject with a symptom of PD. When a PD patient, due to the reduction in the ability to control speech muscle, modifies the length of the vocal tract, the properties of voice modulation in the vocal tract change. The relation is a higher-order relation. The linear separation by statistic test could not properly separate the PD from healthy subjects.
The novelty of this study is the high performance in differentiating between voices of PD from Controls, and which is consistent for two different databases. We are the first study that investigated the use of VTL to identify voices of people with PD and found that VTL parameters outperformed the features reported in the literature that are related to perturbation of glottal vibration, such as jitter, shimmer, pitch frequency, and harmonics ratio. The finding in this study suggests and supports the argument in 35 that the neuro-physiology change in PD patients is manifested more in the change of vocal tract control compared to glottal vibration or air pressure control by the lung. This opens the potential for computerized and remote monitoring of people with PD. www.nature.com/scientificreports/ The limitation of this study is we have only investigated two databases; Columbian-Spanish native speakers and Australian native speakers. Further study needs to be conducted of people from other demographics and ethnicity to validate the findings for global use. While the size of the datasets are sufficient, larger datasets are required that will allow the examination of the various confounding factors. There is also the need to investigate the effect of PD medication such as Levodopa on these parameters and to test this over repeated voice recordings.

Conclusion
This study has investigated the effectiveness of using three sets of voice features of sustained phonemes to differentiate people with PD from age-matched healthy participants using two independent and different sets of publicly available databases. It has found that the most effective feature set was using apparent vocal tract length (VTL). The classification accuracy in identifying PD from control was 84.3% when combining the VTL features of all the five vowels /a/, /e/, /i/, /o/, and /u/. The classification accuracy when using /a/, /o/ and /m/ using Viswanathan dataset obtained using smartphone was 96%. This performance was significantly higher than the accuracy obtained when using the glottal vibration parameters (jitter, shimmer, pitch, and harmonics) and voice intensity. Another advantage of VTL parameters is that there were obtained automatically and thus suitable for computerized analysis of the voice recordings using smartphones. Unlike deep-learning approach, this method has the benefit because it has identified the specific voice parameters which allows the clinician to understand the differences. This has the potential for telephone-based diagnosis for PD.

Data availability
We have used publicly available datasets. GITA dataset is available on request from Orzoco et al. (reference 23 ). Viswanathan dataset is available from contact of reference 32 .